Checking and Cleaning the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import matplotlib.patches as mpatches
import itertools

%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
In [2]:
df = pd.read_csv('no_show_appointments.csv')
df.head(3)
Out[2]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
In [3]:
## Fixing columns names

df.rename(mapper = lambda x: x.lower().replace('-', '_').replace('id', '_id').replace('day', '_day'), 
          axis = 1, 
          inplace= True)
df.head(1) 
Out[3]:
patient_id appointment_id gender scheduled_day appointment_day age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
patient_id         110527 non-null float64
appointment_id     110527 non-null int64
gender             110527 non-null object
scheduled_day      110527 non-null object
appointment_day    110527 non-null object
age                110527 non-null int64
neighbourhood      110527 non-null object
scholarship        110527 non-null int64
hipertension       110527 non-null int64
diabetes           110527 non-null int64
alcoholism         110527 non-null int64
handcap            110527 non-null int64
sms_received       110527 non-null int64
no_show            110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB
In [5]:
## Changing the types of patient_id and appointment_id to strings
## as int or float datatypes for ids here misleading

df['patient_id'] = df['patient_id'].apply(lambda x: str(int(x)))

df['appointment_id'] = df['patient_id'].apply(lambda x: str(x))
In [6]:
## unique values for every column

for column in ['diabetes', 'handcap', 'sms_received', 'alcoholism', 'hipertension', 'no_show']:
    print('{:<18}'.format(column), pd.unique(df[column]))
diabetes           [0 1]
handcap            [0 1 2 3 4]
sms_received       [0 1]
alcoholism         [0 1]
hipertension       [1 0]
no_show            ['No' 'Yes']
In [7]:
def check_for_empty_str(df):
    """
    Check for empty strings in string typed columns
    as the null check of pandas check only for Nones where
    empty string is still a value
    """
    print('\nChecking for empty strings:\n')
    for column in df.columns:
        if df[column].dtype == 'O':
            print('{:<18}'.format(column), df[column].apply(lambda x: x.strip() == '').any())
In [8]:
print(df.isna().any())
check_for_empty_str(df)
patient_id         False
appointment_id     False
gender             False
scheduled_day      False
appointment_day    False
age                False
neighbourhood      False
scholarship        False
hipertension       False
diabetes           False
alcoholism         False
handcap            False
sms_received       False
no_show            False
dtype: bool

Checking for empty strings:

patient_id         False
appointment_id     False
gender             False
scheduled_day      False
appointment_day    False
neighbourhood      False
no_show            False
In [9]:
## Converting categorical no-show, gender columns to numerical value for analysis

df['no_show'] = (df['no_show'] == 'Yes').astype(int)
df['gender'] = (df['gender'] == 'M').astype(int)
df.head(3)
Out[9]:
patient_id appointment_id gender scheduled_day appointment_day age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show
0 29872499824296 29872499824296 0 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 0
1 558997776694438 558997776694438 1 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 0
2 4262962299951 4262962299951 0 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 0
In [10]:
df.shape
Out[10]:
(110527, 14)
In [11]:
## Getting number of unique values in every columns plus checking for any
## duplicate values in appointment_id (shouldn't be any duplicates)

print(
    df.nunique(),
    'Duplicates:\n{:<19}{}'.format('appointment_id', df['appointment_id'].duplicated().any()),
    sep = '\n\n'
)
patient_id          62299
appointment_id      62299
gender                  2
scheduled_day      103549
appointment_day        27
age                   104
neighbourhood          81
scholarship             2
hipertension            2
diabetes                2
alcoholism              2
handcap                 5
sms_received            2
no_show                 2
dtype: int64

Duplicates:
appointment_id     True
In [12]:
df.shape[0]
Out[12]:
110527
In [13]:
## checking the number of duplicate rows and 
## dropping them since they carry no additional information

print(df.duplicated().sum())
df.drop_duplicates(inplace=True)
df.shape[0]
618
Out[13]:
109909
In [14]:
## duplicates in appointment_id

df.duplicated('appointment_id').sum()
Out[14]:
47610

As the appointment_id isn't unique, the column name authenticity smells as generally speaking, multiple appointments can be attended by multiple patients, but multiple patients can't attend the same appointment, so I'll further investigate the values of the appointment_id.

In [15]:
dup_app_ids = df[df.duplicated('appointment_id', keep = False)]
print(dup_app_ids.shape[0])
dup_app_ids.sort_values('appointment_id', inplace=True)
dup_app_ids.head(10)
71843
C:\Users\Abdel_000\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
Out[15]:
patient_id appointment_id gender scheduled_day appointment_day age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show
51059 111124532532143 111124532532143 1 2016-03-31T09:17:26Z 2016-05-03T00:00:00Z 9 MARIA ORTIZ 0 0 0 0 0 1 0
95359 111124532532143 111124532532143 1 2016-04-26T15:05:58Z 2016-06-01T00:00:00Z 9 MARIA ORTIZ 0 0 0 0 0 1 1
28351 1111633122891 1111633122891 1 2016-05-09T11:03:50Z 2016-05-09T00:00:00Z 2 SÃO JOSÉ 0 0 0 0 0 0 0
17494 1111633122891 1111633122891 1 2016-05-18T09:37:39Z 2016-05-18T00:00:00Z 2 SÃO JOSÉ 0 0 0 0 0 0 0
34823 1111633122891 1111633122891 1 2016-05-11T08:45:03Z 2016-05-11T00:00:00Z 2 SÃO JOSÉ 0 0 0 0 0 0 0
31521 111311342971 111311342971 1 2016-05-09T11:13:45Z 2016-05-25T00:00:00Z 0 SANTA TEREZA 0 0 0 0 0 0 0
31622 111311342971 111311342971 1 2016-04-25T11:06:11Z 2016-05-09T00:00:00Z 0 SANTA TEREZA 0 0 0 0 0 0 0
26089 111346221769879 111346221769879 1 2016-05-30T09:06:32Z 2016-05-30T00:00:00Z 76 SANTA MARTHA 0 0 0 0 0 0 0
98614 111346221769879 111346221769879 1 2016-05-03T09:15:31Z 2016-06-03T00:00:00Z 76 SANTA MARTHA 0 0 0 0 0 1 1
105717 111415561531417 111415561531417 0 2016-05-30T15:55:49Z 2016-06-03T00:00:00Z 45 SANTA MARTHA 0 0 0 0 0 1 0

Closely, looking at the appointment_id and patient_id columns in the previous sample, their values seems identical, so I'll check if that the case in the entire data.

In [16]:
## Checking if all appointment_ids equals their corresponding patient_ids

(df['patient_id'] == df['appointment_id']).all()
Out[16]:
True

Since the column is redundant, I'll drop it.

In [17]:
df.drop('appointment_id', axis = 1, inplace=True)
df.head(1)
Out[17]:
patient_id gender scheduled_day appointment_day age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show
0 29872499824296 0 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 0
In [18]:
df.shape[0]
Out[18]:
109909
In [19]:
df.nunique()
Out[19]:
patient_id          62299
gender                  2
scheduled_day      103549
appointment_day        27
age                   104
neighbourhood          81
scholarship             2
hipertension            2
diabetes                2
alcoholism              2
handcap                 5
sms_received            2
no_show                 2
dtype: int64
In [20]:
## Just rechecking that patient_ids covers the entire data 
## Note!! Checking by isnull gives the same, but I wanted to observe
# the different chunks of patient_id

n_unique_ids = df.nunique()['patient_id']
print(n_unique_ids)

all_dup = df.duplicated('patient_id', keep = False).sum()

unique_dup_ids =  all_dup - df.duplicated('patient_id').sum() 

n_unique_ids - unique_dup_ids + all_dup == df.shape[0]
62299
Out[20]:
True

The following investigation I came to by chance, when checking duplications of appointment_id column, I tried to slice the columns to what I thought "inherent" or "slow to change" characteristics of the patients columns which are 'patient_id', 'gender', 'age' columns, but I sliced columns with df.loc[:, :'age'] and I found duplicates.
To my amazement this code included both 'scheduled_day', 'appointment_day' which are date and time columns with very very little chance for time to be duplicated especially with the 'scheduled_day' column. Then, I kept slicing including more columns in each step till number of duplicated columns dropped with the inclusion of 'sms_received' column (and absolutely with the inclusion of 'no-show', duplicates are 0'). That's why I'll eventually drop the first occurrences of such columns keeping only the last pushed states of 'sms_received'.
Below are the steps I did.

In [21]:
def get_numof_duplicates_till_column(df, end_column):
    """
    slice the dataframe df till the end_column column
    and check for duplicate rows and returns them and 
    their number as whole (all occurrences)
    """
    dup_df = df.loc[:, :end_column].duplicated(keep = False)
    return dup_df, dup_df.sum()
In [22]:
print(
    '%-18s' % 'age' + '%d' % get_numof_duplicates_till_column(df, 'age')[1],
    '%-18s' % 'handcap' + '%d' % get_numof_duplicates_till_column(df, 'handcap')[1],
    '%-18s' % 'sms_received' + '%d' % get_numof_duplicates_till_column(df, 'sms_received')[1],
    sep = '\n'
)
age               1432
handcap           1432
sms_received      28

I'll drop the duplicated columns for those till 'sms_received' keeping the last ones as they were pushed to the system last.
(The operator may realized he made a mistake in the 'sms_column' or 'no_show' column and decided to re-enter the data)

In [23]:
duplicates = df.loc[:, :'handcap'].duplicated(keep = 'last')
print(duplicates.sum())
print('\nSame Size?   %s'% (duplicates.shape[0] == df.shape[0]))
df[duplicates].sort_values('patient_id').head(4)
716

Same Size?   True
Out[23]:
patient_id gender scheduled_day appointment_day age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show
101631 1115255573477 1 2016-05-19T17:24:01Z 2016-06-01T00:00:00Z 60 CENTRO 0 0 0 0 0 0 0
72618 113952483828156 0 2016-05-04T15:03:27Z 2016-05-31T00:00:00Z 41 DO MOSCOSO 1 0 0 0 0 1 0
88277 114895395443 0 2016-05-16T07:46:55Z 2016-06-03T00:00:00Z 32 ITARARÉ 0 0 0 0 0 1 0
74802 11558399198834 0 2016-04-27T13:14:20Z 2016-05-10T00:00:00Z 28 ILHA DAS CAIEIRAS 1 0 0 0 0 0 0
In [24]:
## droping the duplicates

df = df[-duplicates]
df.shape
Out[24]:
(109193, 13)
In [25]:
## Double checking for any more duplicates 
get_numof_duplicates_till_column(df, 'sms_received')[1]
Out[25]:
0
In [26]:
## Resetting index
df.reset_index(drop = True, inplace=True)
df.head(3)
Out[26]:
patient_id gender scheduled_day appointment_day age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show
0 29872499824296 0 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 0
1 558997776694438 1 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 0
2 4262962299951 0 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 0

=========================================================================================================================
=========================================================================================================================
=========================================================================================================================

Since appointment_day hour of the day isn't specified in all rows (all are 00:00:00). Then it's not really important what hour was the scheduled_day, too because all appointments can't be in the same hour. Similarly, no point in categorizing the hour of the appointment day or the waiting time of patients scheduling their appointments in the same day as the appointment as both investigations would be not authentic. What may matter is the interval between the appointment_day and scheduled_day. Are they at the same day or not, or whether the interval is too long that the patient may forget or neglect it.

In [27]:
## Converting dates columns from string to datetime objects

df['scheduled_day'] = pd.to_datetime(df['scheduled_day'].str[:10], format = '%Y-%m-%d')
df['appointment_day'] = pd.to_datetime(df['appointment_day'].str[:10], format = '%Y-%m-%d')
In [28]:
df['appointment_day'].dtype
Out[28]:
dtype('<M8[ns]')
In [29]:
## Deriving the interval between from scheduled_day to appointment_day

df['days'] = df['appointment_day'] - df['scheduled_day']
df['days'] = df['days'].dt.days
df.describe()
Out[29]:
gender age scholarship hipertension diabetes alcoholism handcap sms_received no_show days
count 109193.000000 109193.000000 109193.000000 109193.000000 109193.000000 109193.000000 109193.000000 109193.000000 109193.000000 109193.000000
mean 0.350050 37.097003 0.098230 0.197375 0.071864 0.030496 0.022199 0.320597 0.200251 10.065746
std 0.476987 23.140858 0.297626 0.398020 0.258263 0.171950 0.161109 0.466708 0.400190 15.200463
min 0.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -6.000000
25% 0.000000 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 37.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000
75% 1.000000 55.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 14.000000
max 1.000000 115.000000 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000 1.000000 179.000000
In [30]:
## Checking number of false values in age column

df[df['age'] < 0].shape[0]
Out[30]:
1
In [31]:
## Checking if the the same patient made other appointments 
df[df['patient_id'] == df[df['age'] < 0].reset_index()['patient_id'][0]].shape[0]
Out[31]:
1

Age can't be negative, so we'll remove corresponding entries escpecially since there is only one corresponding entry with a patient id that isn't duplicated anywhere in the data, so no way for correction.
Examining the days column where min('days') == -6 < 0. It's clear that some entries are false due to technical or human mistakes, as for those entries the patient appointment_day was before the scheduled_day which can't happen. I wouldn't be surprised if all the patients of the corresponding column didn't attend the appointment. Let's check that anyway.

In [32]:
## FIX THE NEGATIVE ROWS

print('Do any of the patients with wrong appointment day attend the appointment? \n{} as expected'.format(
    (df[df['days'] < 0]['no_show'] == 0).any())
     )
Do any of the patients with wrong appointment day attend the appointment? 
False as expected
In [33]:
## Removing the false rows corresponding to negative days interval 
## between the appointment day and scheduled day

df = df.query('days >= 0').reset_index(drop = True)
(df['days'] < 0).any()
Out[33]:
False
In [34]:
## Removing rows corresponding to negative ages

print('%-12s'%'Before',(df['age']<0).any())
df = df[df['age'] >= 0]
print('%-12s'%'After',(df['age']<0).any())
Before       True
After        False
In [35]:
df.shape
Out[35]:
(109187, 14)
In [36]:
## final check

df.describe()
Out[36]:
gender age scholarship hipertension diabetes alcoholism handcap sms_received no_show days
count 109187.000000 109187.000000 109187.000000 109187.000000 109187.000000 109187.000000 109187.000000 109187.000000 109187.000000 109187.000000
mean 0.350051 37.097521 0.098235 0.197386 0.071868 0.030498 0.022182 0.320615 0.200216 10.066391
std 0.476988 23.140535 0.297634 0.398028 0.258270 0.171954 0.161059 0.466715 0.400164 15.200625
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 37.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000
75% 1.000000 55.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 14.000000
max 1.000000 115.000000 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000 1.000000 179.000000
In [37]:
df.head(3)
Out[37]:
patient_id gender scheduled_day appointment_day age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show days
0 29872499824296 0 2016-04-29 2016-04-29 62 JARDIM DA PENHA 0 1 0 0 0 0 0 0
1 558997776694438 1 2016-04-29 2016-04-29 56 JARDIM DA PENHA 0 0 0 0 0 0 0 0
2 4262962299951 0 2016-04-29 2016-04-29 62 MATA DA PRAIA 0 0 0 0 0 0 0 0

Exploration

Questions:

1. Is there a neighborhood with signifcant absence rate?
2. Does age affect the appointment attendance?
3. Do appointments scheduled very early ahead of the appointment affect the attendance?
4. Do people in weekends attend more appointments than in workdays?
5. What about people with special conditions (diabetes, hypertension, handcap)?
6. What about sms_recepients?
In [38]:
def groupby_and_agg_by_column(df, groupby_column, agg_column = 'no_show', 
                              agg_functions = ['sum', 'mean', 'count']):
    """
    groupby the df data by 'grouby_column' and aggregate the data
    by the requested agg_column which defaults in our analysis to 'no_show' column
    and agg_functions to ['sum', 'mean', 'count']
    and returns the aggregated data
    """
    group_df = df.groupby(groupby_column, as_index=False).agg({agg_column: agg_functions})
    group_df.columns = group_df.columns.droplevel(0)
    group_df.columns = list([groupby_column] if type(groupby_column) == str else groupby_column) + list(group_df.columns[-3:])
    return group_df

Investigate by neighborhood

First, I will try investigate the absence by the mean as the mean reflect the probability of missing the appointment in the neighborhood.

In [39]:
## Grouping by neighborhood, aggreagating, sorting descending by mean, then sum of misses

neigh_df = groupby_and_agg_by_column(df, 'neighbourhood')

neigh_df.sort_values(['mean', 'sum'], ascending=[0,0], inplace=True)
neigh_df.reset_index(drop = True, inplace= True)
neigh_df.head(10)
Out[39]:
neighbourhood sum mean count
0 ILHAS OCEÂNICAS DE TRINDADE 2 1.000000 2
1 SANTOS DUMONT 366 0.289557 1264
2 SANTA CECÍLIA 123 0.275785 446
3 SANTA CLARA 130 0.260521 499
4 ITARARÉ 866 0.254108 3408
5 JESUS DE NAZARETH 695 0.243689 2852
6 HORTO 42 0.240000 175
7 ILHA DO PRÍNCIPE 532 0.234879 2265
8 CARATOÍRA 580 0.229340 2529
9 PRAIA DO SUÁ 294 0.228261 1288
In [40]:
neigh_df.tail(10)
Out[40]:
neighbourhood sum mean count
71 JARDIM DA PENHA 631 0.162965 3872
72 DO QUADRO 135 0.160333 842
73 DO CABRAL 88 0.157706 558
74 DE LOURDES 47 0.156146 301
75 SANTA MARTHA 476 0.155403 3063
76 SOLON BORGES 69 0.147122 469
77 MÁRIO CYPRESTE 52 0.142466 365
78 AEROPORTO 1 0.125000 8
79 ILHA DO BOI 3 0.085714 35
80 PARQUE INDUSTRIAL 0 0.000000 1

Although the mean reflects the probability of missing well, but there are some neighborhood with few number of appointments compared to other neighborhoods.
Practically, the mean here really doesn't reflect upon the number of misses. For example, ILHAS OCEÂNICAS DE TRINDADE neighborhood has 100% misses, but it just represensts 2 appointments likewise in PARQUE INDUSTRIAL neighborhood with 0% misses, but just one appointment.
Hence, this probabilty is weak because for any additional attended appointment entry added for this neighborhood, the mean would change drastically (e.g., .67, .5, .4, .333, etc in case of ILHAS OCEÂNICAS DE TRINDADE mean). Hence, I will investigate the neighborhood with highest number of misses as thier means are more stable.

I will remove any neighborhood with count < 50 considering them outliers. As in these neighborhoods, any change in misses affect the mean drastically.

In [41]:
neigh_df = neigh_df.query('count >= 50').reset_index(drop = True)
neigh_df.describe()
Out[41]:
sum mean count
count 76.000000 76.000000 76.000000
mean 287.539474 0.199366 1435.934211
std 269.206708 0.028219 1343.938782
min 12.000000 0.142466 69.000000
25% 87.250000 0.179523 435.750000
50% 195.500000 0.197022 1042.000000
75% 430.750000 0.216159 2143.250000
max 1401.000000 0.289557 7510.000000
In [42]:
neigh_df.sort_values(['sum', 'mean'], ascending=[0,0], inplace=True)
neigh_df.head(10)
Out[42]:
neighbourhood sum mean count
48 JARDIM CAMBURI 1401 0.186551 7510
26 MARIA ORTIZ 1219 0.210028 5804
3 ITARARÉ 866 0.254108 3408
36 RESISTÊNCIA 850 0.197307 4308
4 JESUS DE NAZARETH 695 0.243689 2852
25 CENTRO 689 0.210189 3278
69 JARDIM DA PENHA 631 0.162965 3872
7 CARATOÍRA 580 0.229340 2529
54 TABUAZEIRO 563 0.181379 3104
33 BONFIM 548 0.198983 2754
Top 10 neighborhoods with appointments misses.
In [43]:
plt.figure(figsize = (10,10))
ax = plt.subplot(111)
ax.xaxis.set_label_position('top')
ax.xaxis.tick_top()
plt.grid(True, alpha = .5, axis = 'x', zorder = 3.5, c = 'gray', linestyle = '--')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
container = plt.barh(range(0, 10), neigh_df['sum'][:10], zorder = 2.5)
means_max = neigh_df['mean'][:10].max()

X = list(range(10))
opacity = list(np.linspace(1,.2,10))
Y = list(neigh_df['mean'][:10])
order = [x for _,x in sorted(zip(Y,X), reverse = True)]
opacity = dict(zip(order,opacity))
for i in range(len(container.patches)):
    container.patches[i].set_alpha(opacity[i])

    
plt.yticks(range(0, 10), neigh_df['neighbourhood'][:10])
plt.ylabel('Neighbourhood\n', fontsize = 14)
plt.xlabel('Appointments Misses Count\n', fontsize = 14)
plt.title("\n\nTop 10 Appointment Misses by Neighbourhood\n\n\n\n", fontsize = 16)

plt.annotate('The opacity of columns represent the mean of misses\n with respect to maximum mean in the group\n(ITARARE neighborhood)', xy= np.array([850, 2]), xytext=(300,-100),
             ha='center', textcoords='offset points', bbox=dict(boxstyle='round,pad=0.5', fc='gray', alpha=0.1),
             arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.5',color='gray'))

plt.gca().invert_yaxis()
plt.show()

Considering an average mean of 0.2 for all neighborhoods, JARDIM DA PENHA, TABUAZEIRO and JARDIM CAMBURI neighborhoods means are below the average reflecting medical awareness in this neighborhood may be better than most of the other neighborhoods. Meanwhile neighborhoods such as CARATOÍRA, JESUS DE NAZARETH and especially ITARARÉ reflect maybe some kind of indifference among people there to medical conditions as their misses means are well above and more than one standard deviations from the average.
Situation dictates medical awareness campaigns to be concentrated on such areas.

Investigate by Age

In [44]:
def sytle_plot_spines(*axes):
    """
    takes axes object and apply my favorite style to the plot axes
    """
    for ax in axes:
        ax.spines['top'].set_visible(False)  
        ax.spines['right'].set_visible(False) 
        ax.spines['left'].set_alpha(.6)
        ax.spines['bottom'].set_alpha(.6)

First, I'll take general look at the appointment distributions with respect to age

In [45]:
plt.figure(figsize = (12, 4))
ax1 = plt.subplot(121)
plt.hist(df.query('no_show == 1')['age'], bins = range(0, 110, 10),  edgecolor = 'black', alpha = .6)
plt.title('\n\nMissed Appointments\n\n', fontsize = 14, loc = 'left')
ax2 = plt.subplot(122)
plt.hist(df.query('no_show == 0')['age'], bins = range(0, 110, 10), edgecolor = 'black', alpha = .6)
plt.title('\n\nAttended Appointments\n\n', fontsize = 14, loc = 'left')

sytle_plot_spines(ax1, ax2)
plt.xticks(range(0, 100,10));

It seems most of the misses is from young ages 0-10, but most of the attendances are from that same group.
People between 10-40 have many misses and not too many attendances in comparison with 0-10 group. So, I expect the most misses rate to be in this region, especially 10-20 and 20-30 groups. I'll categorize the ages to age groups to be more specific to real life and analyze the misses means.

In [46]:
## Categorized the age column
df['age_group'] = pd.cut(df['age'], bins=[0, 0.6, 2, 12, 18, 25, 40, 60, 200], 
       include_lowest=True, right = True, 
       labels=['new born', 'infant', 'child', 'teenager', 'young adult', 'adult', 'middle age', 'old'])
In [47]:
## grouping by age_group
age_df = groupby_and_agg_by_column(df, 'age_group')
age_df
Out[47]:
age_group sum mean count
0 new born 634 0.179909 3524
1 infant 664 0.171222 3878
2 child 2937 0.218624 13434
3 teenager 2003 0.259221 7727
4 young adult 2425 0.252473 9605
5 adult 4775 0.219651 21739
6 middle age 5467 0.184179 29683
7 old 2956 0.150839 19597
In [48]:
## categorize by scholarship << or gender 
group_order = ('new born', 'infant', 'child', 'teenager', 'young adult', 'adult', 'middle age', 'old')
func = lambda x: '\n' + x
group_order = tuple(func(i) for i in group_order)
plt.figure(figsize = (10, 5))
ax = plt.subplot(111)
sytle_plot_spines(ax)

cont = plt.bar(range(0, len(age_df['age_group']) * 8, 8), age_df['mean'], width = 5, color = '#0073FF')
cont[3].set_color('#75e504')

plt.xticks(range(0, len(age_df['age_group']) * 8, 8), group_order);
plt.xlabel('\nAge Group\n\n', fontsize = 12)
plt.ylabel('Probablity of Missing the Appointment\n\n', fontsize = 12)
plt.title('\nMissing rate for each Age Group\n', fontsize = 14)
ax.set_facecolor('#f4f4f9')
plt.show()

It's clear that teenagers, young adults, adults and children constitute the highest rates as expected (10-40) from initial investigation with teenagers representing the highest misses mean. While children and possibly teenagers responsibility may lay on families. Young adults and adults responsibilities lie on themselves reflecting the medical awareness of individuals of different ages to themselves. However, could financial problems are the main cause of absence? I'll futher investigate the data with scholarship to pinpoint the problem.

In [49]:
## General investigation of the scholarship column
schol_df = groupby_and_agg_by_column(df, 'scholarship')
schol_df
Out[49]:
scholarship sum mean count
0 0 19331 0.196332 98461
1 1 2530 0.235875 10726

Generally, It seems scholarships don't affect the probability of attendance positively. As the mean of scholarship given misses is greater than non-scholarship given individuals. Is this case among different age groups?

In [50]:
temp = groupby_and_agg_by_column(df, ['age_group', 'scholarship'])
temp.head(10)
Out[50]:
age_group scholarship sum mean count
0 new born 0 627 0.180588 3472
1 new born 1 7 0.134615 52
2 infant 0 637 0.173192 3678
3 infant 1 27 0.135000 200
4 child 0 2512 0.216272 11615
5 child 1 425 0.233645 1819
6 teenager 0 1695 0.256974 6596
7 teenager 1 308 0.272325 1131
8 young adult 0 2137 0.249097 8579
9 young adult 1 288 0.280702 1026
In [51]:
plt.figure(figsize = (10,5))
ax = plt.subplot(111)
ticks = []
labels = list(range(4, 4 + 12 * 8, 12))
step = 12
pointer = 0
for i in range(int(temp.shape[0] / 2)):
    ticks += [pointer, pointer + 4]
    labels += [pointer + 4]
    pointer += step

sytle_plot_spines(ax)
bar_cont  = plt.bar(ticks, temp['mean'], width = 4, align = 'edge', alpha = .7)
patches = bar_cont.patches

for i in range(len(patches)):
    if i % 2 == 0:
        patches[i].set_fc('#a1f79b')
    else:
        patches[i].set_fc('#FFB000')
        
non_sponsored = mpatches.Patch(color='#a1f79b', label='Not Sponsored')
sponsored = mpatches.Patch(color ='#FFB000', label = 'Sponsored')
plt.legend(handles=[non_sponsored, sponsored], fontsize = 11)


plt.xticks(labels, group_order);
plt.title('\nAppointment Absence Mean by Scholarship\n', fontsize = 14);
plt.xlabel('\nAge Group\n\n', fontsize = 12)
plt.ylabel('Probablity of Missing the Appointment\n\n', fontsize = 12)
Out[51]:
Text(0,0.5,'Probablity of Missing the Appointment\n\n')

Although scholarsihps given to new borns and infants rates are better than non-given ones. Scholarships failed to explain the missing rates of the other age groups. Meaning there is no direct simple relationship between scholarships and attendance in different age groups. That's why attendance I may relate to medical awareness, but my results here are tentative as scholarship can be proven to be useful in predicting attendance through deep neural networks or other classification model.

Investigate by days to appointment

In [52]:
plt.figure(figsize=(6,6))
ax = plt.subplot(111)
g = sns.boxplot(x='no_show', y = 'days', data = df, zorder = 3.6);
g.set(ylim = (-10, 65))
ax.set_facecolor('#efeded')
plt.show()
In [53]:
df[df['days'] > 50].shape[0]
Out[53]:
3046

Although there are many possible outliers in the days column. Their number is big, so they can't be completely excluded. I'll categorize the days to same day, same week, up to one month, up to two months, and beyond.

In [54]:
df['interval_category'] = pd.cut(df['days'], bins=[0, .5, 7, 30, 60, 1000], 
       include_lowest=True, right = True, 
       labels=['same_day', 'same_week', 'same_month', 'two_months', 'beyond'])

df.head(3)
Out[54]:
patient_id gender scheduled_day appointment_day age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show days age_group interval_category
0 29872499824296 0 2016-04-29 2016-04-29 62 JARDIM DA PENHA 0 1 0 0 0 0 0 0 old same_day
1 558997776694438 1 2016-04-29 2016-04-29 56 JARDIM DA PENHA 0 0 0 0 0 0 0 0 middle age same_day
2 4262962299951 0 2016-04-29 2016-04-29 62 MATA DA PRAIA 0 0 0 0 0 0 0 0 old same_day
In [55]:
days_df = groupby_and_agg_by_column(df, 'interval_category')
days_df
Out[55]:
interval_category sum mean count
0 same_day 1782 0.046295 38492
1 same_week 7676 0.240997 31851
2 same_month 9101 0.316227 28780
3 two_months 2727 0.340364 8012
4 beyond 575 0.280214 2052
In [56]:
plt.figure(figsize = (8,4))
sytle_plot_spines(plt.subplot(111))
plt.bar(days_df['interval_category'], days_df['mean'], alpha = .7)
plt.title('\nTime to Appointment by Time Category\n', fontsize = 14);
plt.xlabel('\nTime Category\n\n', fontsize = 12)
plt.ylabel('Probablity of Missing the Appointment\n\n', fontsize = 12)
plt.show()

People scheduling the appointment at the same day have very low missing rate as expected and it seems the more the interval between the the scheduling and the appointment, the higher the misses mean. 'beyond' group catches outliers (more than 60 days intervals) Although its mean is less than 'same_month' and 'two_months' mean, it catches outliers with relatively small group size compared to other group sizes, so it could be handled as special case alone.

Investigate by day of the week

In [57]:
df['day_of_week'] = df['appointment_day'].dt.dayofweek
df.head(3)
Out[57]:
patient_id gender scheduled_day appointment_day age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show days age_group interval_category day_of_week
0 29872499824296 0 2016-04-29 2016-04-29 62 JARDIM DA PENHA 0 1 0 0 0 0 0 0 old same_day 4
1 558997776694438 1 2016-04-29 2016-04-29 56 JARDIM DA PENHA 0 0 0 0 0 0 0 0 middle age same_day 4
2 4262962299951 0 2016-04-29 2016-04-29 62 MATA DA PRAIA 0 0 0 0 0 0 0 0 old same_day 4
In [58]:
from datetime import datetime
import calendar 

by_dayofweek_df = groupby_and_agg_by_column(df, 'day_of_week')
day_mapper =dict(zip(range(7), calendar.day_name))
by_dayofweek_df['week_name'] = np.array([day_mapper[x] for x in by_dayofweek_df['day_of_week']])
by_dayofweek_df
Out[58]:
day_of_week sum mean count week_name
0 0 4596 0.204740 22448 Monday
1 1 5029 0.198688 25311 Tuesday
2 2 4987 0.195026 25571 Wednesday
3 3 3280 0.192049 17079 Thursday
4 4 3960 0.211324 18739 Friday
5 5 9 0.230769 39 Saturday

Basically, there is no appointment at all on Sundays. On Saturday, there are too few appointments to make comparisons between workdays and weekends appointments. Consequentally, there is no point on carrying on this analysis between workdays and weekends.

Investigatation by alcoholism and other medical conditions

First I will investigate every column mean individually, then the interaction between the columns

In [59]:
def get_probability_of_no_show(df, columns, values):
    """
    get binary-valued 'columns' and returns the missing mean 
    of the combinations of column values stored in 'values'
    in addition to the number of appointments to the query
    """
    if len(columns) != len(values):
        raise Exception('Number of columns not equal to number of values')
    else:
        query = ''
        for column, value in zip(columns, values):
            query += '{} == {} and '.format(column, value)
        query = query[:-4]
        temp_df = df.query(query)['no_show']
        return temp_df.mean(), temp_df.count() 
In [60]:
## Aggregate mean over entire group categories 

columns = ['alcoholism', 'diabetes', 'hipertension']

print('%24s'%'Mean', '%10s'%'Count')
for column in columns:
    print('%-18s'%column, '%-10.3f %d'%get_probability_of_no_show(df, [column],[1]))
                    Mean      Count
alcoholism         0.199      3330
diabetes           0.180      7847
hipertension       0.171      21552
In [61]:
def label_venn(v,df , columns, repeat):
    """
    takes venn v object(either venn2, venn3 or more if there is), columns to represent 
    in the venn diagram and label each region in the venn diagram with its probability 
    of missing and the regions number of samples
    """
    combinations = list(itertools.product([0, 1], repeat= repeat))
    combinations.remove((0,) * repeat)

    for comb in combinations:
        v.get_label_by_id(''.join(str(x) for x in comb)).set_text('%.3f/%d'%get_probability_of_no_show(df, columns, comb))
In [62]:
## Measuring interaction between each  2 combination
## of the columns 'alcoholism', 'diabetes', 'hipertension'

from matplotlib_venn import venn2, venn2_circles
figure, axes = plt.subplots(1, 3)
figure.set_size_inches((16,5))

index = 0
for comb in list(itertools.combinations(columns, 2)):
    v = venn2(subsets=(1,1,1), set_labels = ['\n\n' + i +'\n\n' for i in comb], ax=axes[index])
    c = venn2_circles(subsets=(1, 1, 1), linestyle='-',  ax = axes[index])
    c[0].set_edgecolor('#1ef213')
    c[1].set_edgecolor('#29c3ce')
    v.get_patch_by_id('A').set_color('#1ef213')
    v.get_patch_by_id('B').set_color('#29c3ce')
    label_venn(v, df, comb, 2)
    index += 1
    
plt.annotate('\nMean of misses/Num of appointments\n', xy=v.get_label_by_id('100').get_position() - np.array([0, 0.05]), xytext=(-100,110),
             ha='center', textcoords='offset points', bbox=dict(boxstyle='round,pad=0.5', fc='gray', alpha=0.1),
             arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.5',color='gray'))

del index
plt.show()

It seems people with more than one special condition misses rate are lower like with diabetes and alcoholism however that's not the case with hypertensions.

Examining Interaction between all three columns at the same time ('alcoholism', 'diabetes', 'hipertension')

In [63]:
from matplotlib_venn import venn3, venn3_circles

plt.figure(figsize=(8,8))
v = venn3(subsets=(1, 1, 1, 1, 1, 1, 1), set_labels = ('Alcoholism', 'Diabetes', 'Hypertension'))

label_venn(v, df, columns, 3)

venn3_circles(subsets=(1, 1, 1, 1, 1, 1, 1), linestyle='--')

plt.title("\n\nAlcoholism, Diabetes, Hypertension Probabilistic Effect on Attendance\n\n", fontsize = 14)
plt.annotate('Mean of misses/Num of appointments', xy=v.get_label_by_id('100').get_position() - np.array([0, 0.05]), xytext=(-170,-70),
             ha='center', textcoords='offset points', bbox=dict(boxstyle='round,pad=0.5', fc='gray', alpha=0.1),
             arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.5',color='gray'))
plt.show()

I will examine the three variables, handicap, sms for completion of analysis in case I missed any thing important, then I will build simple classification model with scik

In [64]:
temp_df = groupby_and_agg_by_column(df, 'handcap')
print(temp_df)
plt.bar(temp_df['handcap'], temp_df['mean']);
   handcap    sum      mean   count
0        0  21466  0.200660  106977
1        1    354  0.175682    2015
2        2     37  0.204420     181
3        3      3  0.272727      11
4        4      1  0.333333       3

As the group sizes of handicap degree 4 and 3 very small, we may not consider their means as significant.

In [65]:
temp_df = groupby_and_agg_by_column(df, 'sms_received')
print(temp_df)
plt.bar(temp_df['sms_received'], temp_df['mean']);
plt.xticks(range(0,2));
   sms_received    sum      mean  count
0             0  12246  0.165085  74180
1             1   9615  0.274659  35007

The proportion of missing of those who received sms messages is more than those who didn't receive, surprisingly

Classification of no-show appointments based on age and days interval to appointment

In [66]:
import statsmodels.api as sm
print(df.columns)
df.head()
Index(['patient_id', 'gender', 'scheduled_day', 'appointment_day', 'age',
       'neighbourhood', 'scholarship', 'hipertension', 'diabetes',
       'alcoholism', 'handcap', 'sms_received', 'no_show', 'days', 'age_group',
       'interval_category', 'day_of_week'],
      dtype='object')
Out[66]:
patient_id gender scheduled_day appointment_day age neighbourhood scholarship hipertension diabetes alcoholism handcap sms_received no_show days age_group interval_category day_of_week
0 29872499824296 0 2016-04-29 2016-04-29 62 JARDIM DA PENHA 0 1 0 0 0 0 0 0 old same_day 4
1 558997776694438 1 2016-04-29 2016-04-29 56 JARDIM DA PENHA 0 0 0 0 0 0 0 0 middle age same_day 4
2 4262962299951 0 2016-04-29 2016-04-29 62 MATA DA PRAIA 0 0 0 0 0 0 0 0 old same_day 4
3 867951213174 0 2016-04-29 2016-04-29 8 PONTAL DE CAMBURI 0 0 0 0 0 0 0 0 child same_day 4
4 8841186448183 0 2016-04-29 2016-04-29 56 JARDIM DA PENHA 0 1 1 0 0 0 0 0 middle age same_day 4
In [67]:
df2 = df[['age' , 'days' , 'no_show']]
df2.head()
Out[67]:
age days no_show
0 62 0 0
1 56 0 0
2 62 0 0
3 8 0 0
4 56 0 0
In [68]:
## Getting the correlation coefficient betwee age and days columns
np.corrcoef(df2['age'], df2['days'])[0,1]
Out[68]:
0.03546280597665654
In [69]:
plt.figure(figsize = (8,6))
sytle_plot_spines(plt.subplot(111))
plt.scatter(df2['age'], df2['days'],  alpha = .4)
plt.xlabel('\nAge', fontsize = 12)
plt.ylabel('Days to Appointment\n', fontsize = 12)
plt.title('\nDays to Appointment and Age Relation\n\n', fontsize = 15)
plt.show()

Based on the correlaiton coefficient and the scatter plot, days and age columns seems independent

In [70]:
c = df2['no_show'].apply(lambda x: 'blue' if x == 1 else 'red')
plt.figure(figsize = (10, 7))
sytle_plot_spines(plt.subplot(111))
plt.ylim(0, 120)
plt.scatter(df2['age'], df2['days'],  alpha = .4,  c = c)

plt.xlabel('\nAge', fontsize = 12)
plt.ylabel('Days to Appointment\n', fontsize = 12)
plt.title('\nDays to Appointment and Age Relation\n\n', fontsize = 15)


showed = mpatches.Patch(color='red', label='Attended')
not_showed = mpatches.Patch(color ='blue', label = 'Missed')

plt.legend(handles = [showed, not_showed],
           loc = 'center', bbox_to_anchor=(.9, .3, 0.3, 1),
           title = 'No-Show', fontsize = 11)

plt.show()

Although the data are heavily overlapped, there certain regions with heavily-densed misses and others with heavily-densed attendances. I am not a machine learning expert, just novice, but the way I think this classification can work, maybe by dividing the data to smaller regions and classify each subproblem individually. I will try classification with sklearn with both the whole data and the cutted data and will see the accuracy of the classification.

In [71]:
## proportion of appointment in the same day to all appointments

df2.query('days == 0').shape[0]/ df2.shape[0]
Out[71]:
0.3525328106825904
In [72]:
print(
    '%.3f'%df2.query('days == 0')['no_show'].mean(),
    '%.3f'%df2.query('days == 1')['no_show'].mean(), sep  = '\n'
)
0.046
0.213

One important thing to note from my previous analysis is that great proportion of the appointments was in the same day and had very low miss rate, so looking at the previous plot realizing that just one small horizontal line out of more than 100 different lines emerging from the 0 days containing more than one third of the data and with very few miss rate, that may gravely affect the classification of other points, so I would check that.

In [73]:
## I will discard data points with days more than or equal to 100 (considering them outliers)
## and with age more than or equal to 100 to concentrate the classification upon the more coherent

df2 = df2.query('days < 100 and age < 100')
df2.shape
Out[73]:
(109040, 3)
In [74]:
df2['intercept'] = 1
lmodel = sm.Logit(df2['no_show'], df2[['intercept', 'age', 'days']]).fit()
lmodel.summary()
Optimization terminated successfully.
         Current function value: 0.482023
         Iterations 5
Out[74]:
Logit Regression Results
Dep. Variable: no_show No. Observations: 109040
Model: Logit Df Residuals: 109037
Method: MLE Df Model: 2
Date: Thu, 31 Jan 2019 Pseudo R-squ.: 0.03697
Time: 18:00:43 Log-Likelihood: -52560.
converged: True LL-Null: -54578.
LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
intercept -1.4391 0.015 -95.963 0.000 -1.468 -1.410
age -0.0077 0.000 -22.620 0.000 -0.008 -0.007
days 0.0285 0.000 61.012 0.000 0.028 0.029
  • The 0 p-values for both age and days columns coefficient indicate their significance, non of the coefficients can be 0.
  • For every year decrease in age, there is 1.00773 more chance for the patient to show.
  • For every day increase, there is 1.02829 more chance for the patient to show.
In [75]:
df_exc_day0 = df2.query('days > 0')
lmodel = sm.Logit(df_exc_day0['no_show'], df_exc_day0[['intercept', 'age', 'days']]).fit()
lmodel.summary()
Optimization terminated successfully.
         Current function value: 0.589658
         Iterations 5
Out[75]:
Logit Regression Results
Dep. Variable: no_show No. Observations: 70550
Model: Logit Df Residuals: 70547
Method: MLE Df Model: 2
Date: Thu, 31 Jan 2019 Pseudo R-squ.: 0.01178
Time: 18:00:43 Log-Likelihood: -41600.
converged: True LL-Null: -42096.
LLR p-value: 3.830e-216
coef std err z P>|z| [0.025 0.975]
intercept -0.6867 0.018 -38.222 0.000 -0.722 -0.651
age -0.0099 0.000 -26.778 0.000 -0.011 -0.009
days 0.0084 0.001 16.033 0.000 0.007 0.009
  • The 0 p-values for both age and days columns coefficient indicate their significance, non of the coefficients can be 0.
  • For every year decrease in age, there is 1.00995 more chance for the patient to show.
  • For every day increase, there is 1.00844 more chance for the patient to show.
In [76]:
## Aggregate columns by days

temp = groupby_and_agg_by_column(df2, 'days')
temp.head()
Out[76]:
days sum mean count
0 0 1782 0.046298 38490
1 1 1098 0.212791 5160
2 2 1595 0.238166 6697
3 3 632 0.234248 2698
4 4 1218 0.231868 5253
In [77]:
temp.describe()
Out[77]:
days sum mean count
count 99.000000 99.000000 99.000000 99.000000
mean 49.000000 220.383838 0.291495 1101.414141
std 28.722813 349.629533 0.093967 3994.533425
min 0.000000 0.000000 0.000000 1.000000
25% 24.500000 17.000000 0.255822 57.000000
50% 49.000000 47.000000 0.313068 160.000000
75% 73.500000 308.000000 0.343135 971.000000
max 98.000000 1782.000000 0.448598 38490.000000
In [78]:
# comparing the shift in no_show mean between days

plt.figure(figsize=(6,4))
sytle_plot_spines(plt.subplot(111))

diffs = []
for i in range(len(temp['days']) - 1):
    diffs += [temp['mean'][i + 1] - temp['mean'][i]]
    
print('Diffs Mean:   %.7f' % np.mean(diffs))

plt.scatter(temp['days'][1:], diffs)
plt.ylim((-2 * np.std(diffs), 2 * np.std(diffs)))
plt.xlim((0, 100))

plt.title('\nDay to Day Missing Rate difference\n\n', fontsize = 14)
plt.xlabel('\nDays', fontsize = 12)
plt.ylabel('Means Differences\n', fontsize = 12);
Diffs Mean:   0.0015684

Comparing the model with day 0 and without day 0, we can see significant change in days coefficient from .0285 to .0084 which is reasonable as more than one third of the day no-show mean shifted .1665 (from 0.046298 to 0.212791) in mean while the average day to day shift is 0.0015684. Therefore, I will adapt the second model excluding day 0 data as day 0 data is with no-show mean .046 is pretty predicted.

Classifying the data as whole

In [79]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

np.random.seed(42)
In [80]:
def train_model(df, X, y, method):
    """
    train model of data 'df' with 'X' columns as predictors 
    and 'y' columns as response using the sklearn model 'method'
    """
    X_train, X_test, y_train, y_test = train_test_split(df[X] , df[y], 
                                                    test_size = .2,
                                                    random_state = 42)
    
    model = method(random_state=42, tol=1e-5)
    model.fit(X_train, y_train)
    y_predict = model.predict(X_test)

    
    print(
        '\n'
        '%-20s'%'Precision' + '%.4f' % precision_score(y_test, y_predict),
        '%-20s'%'Recall' + '%.4f' % recall_score(y_test, y_predict),
        '%-20s'%'Accuracy' + '%.4f' % accuracy_score(y_test, y_predict), '\n',
        'Confusion matrix', confusion_matrix(y_test, y_predict),
        sep = '\n'
    )
In [81]:
train_model(df2, ['days', 'age'], 'no_show', LogisticRegression)
Precision           0.3199
Recall              0.0243
Accuracy            0.7999


Confusion matrix
[[17342   219]
 [ 4144   103]]
In [82]:
train_model(df_exc_day0, ['days', 'age'], 'no_show', LogisticRegression)
Precision           0.0000
Recall              0.0000
Accuracy            0.7213


Confusion matrix
[[10177     3]
 [ 3930     0]]

With a recall score of 0.000, it's clear that the model is doing no-good in predicting people missing appointments. The model of recall .0243 of data with day 0 is doing no-good, too, but it's better than those wihtout day 0.

In [83]:
train_model(df2, ['days', 'age'], 'no_show', LinearSVC)
Precision           0.3023
Recall              0.0490
Accuracy            0.7928


Confusion matrix
[[17081   480]
 [ 4039   208]]
In [84]:
train_model(df_exc_day0, ['days', 'age'], 'no_show', LinearSVC)
Precision           0.0000
Recall              0.0000
Accuracy            0.7215


Confusion matrix
[[10180     0]
 [ 3930     0]]
C:\Users\Abdel_000\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)

From the previous two analyses, it seems extracting 0 day was a mistake in anlysis, so I'll use df2 with day 0 to predict misses after dividing the data to 9 smaller regions.

In [85]:
def plot_scatter(df):
    c = df['no_show'].apply(lambda x: 'blue' if x == 1 else 'red')
    plt.figure(figsize = (3, 3))
    sytle_plot_spines(plt.subplot(111))
    
    plt.scatter(df['age'], df['days'],  alpha = .4,  c = c)

    plt.xlabel('\nAge', fontsize = 12)
    plt.ylabel('Days to Appointment\n', fontsize = 12)
    plt.title('\nDays to Appointment and Age Relation\n\n', fontsize = 15)

    plt.show()
In [86]:
## Now I'll try dividing the data
for i in range(33,100,33):
    for j in range(33, 100, 33):
        temp_df = df2.query('age >= %d and age < %d and days >= %d and days < %d' % (j - 33, j, i - 33, i))
        train_model(
            temp_df,
            ['days', 'age'], 'no_show', 
            LinearSVC
        )
        plot_scatter(temp_df)
        print('==' * 50)
Precision           0.3828
Recall              0.3016
Accuracy            0.7488


Confusion matrix
[[6070  914]
 [1313  567]]
====================================================================================================
C:\Users\Abdel_000\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
Precision           0.0000
Recall              0.0000
Accuracy            0.8273


Confusion matrix
[[7281    0]
 [1520    0]]
====================================================================================================
C:\Users\Abdel_000\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
Precision           0.0000
Recall              0.0000
Accuracy            0.8577


Confusion matrix
[[2061    0]
 [ 342    0]]
====================================================================================================

Precision           0.0000
Recall              0.0000
Accuracy            0.6066


Confusion matrix
[[407   0]
 [264   0]]
C:\Users\Abdel_000\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
====================================================================================================

Precision           0.3114
Recall              1.0000
Accuracy            0.3114


Confusion matrix
[[  0 460]
 [  0 208]]
====================================================================================================

Precision           0.0000
Recall              0.0000
Accuracy            0.8000


Confusion matrix
[[124   0]
 [ 31   0]]
C:\Users\Abdel_000\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
C:\Users\Abdel_000\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
====================================================================================================

Precision           0.0000
Recall              0.0000
Accuracy            0.5000


Confusion matrix
[[28  0]
 [28  0]]
====================================================================================================
C:\Users\Abdel_000\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
Precision           0.0000
Recall              0.0000
Accuracy            0.7360


Confusion matrix
[[92  0]
 [33  0]]
====================================================================================================
C:\Users\Abdel_000\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
Precision           0.0000
Recall              0.0000
Accuracy            0.7971


Confusion matrix
[[55  0]
 [14  0]]
====================================================================================================

Manay sub groups classification model are doing poorly, but there is some improvement in the recall like in group one. Hence, we may further subdivide the group keeping minum size of a group and desired recall ratio as limitations to the algorithms.
Finally , Results obtained here are tentative and that concludes the analysis.

Data Set Limitaions.

  • is the data provided by the dataset sufficient to answer your question?
  • is the size of the dataset is sufficient to give a good judgement about the questions you asked?
  • What are the major difficulties you encountered in the dataset while performing the analysis
  1. Some of the questions specific to patients missing the appointment can't be answered using this data, examples:
    • What time of day appointments suffer the most misses rates (morning, evening, rush hours, etc) and that's the specific appointments hours, minutes, seconds times aren't specified, rather all of them are defaulted to 00:00:00.
    • Some important information that may better predict the missing rates about individuals doesn't exist in the data like individual income, level of education.
    • The fact that the appointment_id was the same as the patient_id forces us to rely upon the appointment_day column to reflect upon the number of appointments a particular patient had which is incorrect as the patient could have had multiple appointments in the same day.
  2. The size of the data set is near 100,000 which is good size for a dataset to draw analaysis from, but this size when categorized by some of the attributes of data like neighborhood, some deficiencies appear in analysis due to the deficiency of number of records for this specific neighborhood or that specific medical condition like handcap. Generally in analysis, the more data, the better generality.
  3. Major difficulties:
    • In cleaning the data, the duplicate duplicate scheduled day, as for the scheduled day, every entry value specified time till seconds (very small granularity), it seemed obscure to have duplicate entries in the columns. After investigation, I realised that those duplicate entries had same patient_id with the lastly pushed entries have a changed value in another column like sms_received or a medical condition. This situation can happen for example if the operator entered some wrong information for the patient and the patient corrected him, so he will reenter the data of the patient with the corrected info with the same scheduled day information. That why I chose to keep the last columns
    • When performing classification analysis for misses rate based on the independent variables age and days. The data was heavily overlapped for classification. I tried to furhter categorize the data with one or more independent variable like scholarship, gender, alcoholism, but the data remained heavily overlapped. That's why I removed this step from analysis and kept the analysis based on only age and days columns with subdivided the data to smaller ones.